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Abstract 

Background: The pig is an economically important food source, amounting to approximately 40% of all meat 
consumed worldwide. Pigs also serve as an important model organism because of their similarity to humans at the 
anatomical, physiological and genetic level, making them very useful for studying a variety of human diseases. A 
pig strain of particular interest is the miniature pig, specifically the Wuzhishan pig (WZSP), as it has been extensively 
inbred. Its high level of homozygosity offers increased ease for selective breeding for specific traits and a more 
straightforward understanding of the genetic changes that underlie its biological characteristics. WZSP also serves 
as a promising means for applications in surgery, tissue engineering, and xenotransplantation. Here, we report the 
sequencing and analysis of an inbreeding WZSP genome. 

Results: Our results reveal some unique genomic features, including a relatively high level of homozygosity in the 
diploid genome, an unusual distribution of heterozygosity, an over-representation of tRNA-derived transposable 
elements, a small amount of porcine endogenous retrovirus, and a lack of type C retroviruses. In addition, we 
carried out systematic research on gene evolution, together with a detailed investigation of the counterparts of 
human drug target genes. 

Conclusion: Our results provide the opportunity to more clearly define the genomic character of pig, which could 
enhance our ability to create more useful pig models. 
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Background 

Domestic pigs belong to the mammalian clade Artiodac- 
tyla, a group of even-toed, hoofed animals whose extant 
representatives include ruminants such as cattle and sheep. 
In contrast to ruminants, pigs are omnivores and can easily 
adapt to changes in diet, and possess a digestive system 
that is simple, anatomically and physiologically distinct 
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from ruminant stomachs. As one of the oldest forms of 
livestock, pigs were domesticated as early as 8,000-10,000 
BC from Eurasian wild boars [1]. Pigs are commonly raised 
for meat, which is the most important animal protein food 
source, and feed a majority of the global population. Other 
pig derivatives include industrial materials such as 
pharmaceutical-grade heparin, which is mostly derived 
from mucosal tissues of the pig small intestine. 

Unlike other domesticated animals such as cattle and 
sheep, pigs were mainly raised in agricultural societies 
and settled farming communities, rather than by no- 
madic people. After a long period of breeding outdoors 
in yards or fields, pigs evolved eating patterns resem- 
bling those of human beings. The similarity dietary 
structure, as well as the close resemblance of pigs' 
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Table 1 Global statistics of the pig genome 
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digestive organs to those in humans (i.e., the stomach, 
pancreas and small intestine), enable pigs to develop di- 
gestion, nutrient absorption, metabolism and intestinal 
microflora in common with human beings. Pigs also 
share a number of diseases with humans, such as obes- 
ity, atherosclerosis, cardiovascular disease, gastroentero- 
pathy and immunological diseases [2-4]. It is therefore 
possible to consider the pig as a unique pharmacology 
and toxicology model for the investigation of human 
health. Until now, pigs have been employed in studies 
involving 38 kinds of human disease, including cardio- 
vascular and metabolic diseases [2]. In addition, pigs and 
humans share similarities in the size of organs as well as 
various other aspects of anatomy and physiology, mak- 
ing pigs the most promising candidate for development 
of new surgical procedures, tissue engineering techni- 
ques and xenotransplantation [5,6] . 

Of all members of the pig family, the miniature pig 
contains significant breeding and handling advantages, 
and has been proven to be particularly valuable in bio- 
medical research. There are 72 native breeds of pig in 
China, accounting for about one third of worldwide 
breeds, including four indigenous miniature pigs, the 
Wuzhishan, Xiang, Diannan small-ear and Tibetan 
breeds [7]. This exuberant resource of pig breeds pro- 
vides a large variety of genotypes and phenotypes, facili- 
tating the use of pigs as models of different human 
diseases. Among these species, WZSP, characterized by 
its small adult size with a mature body weight of only 
approximately 30 kg, is one of the rare and endangered 
breeds previously distributed in the mountain area of 
Hainan province, China. Since 1987, the Institute of Ani- 
mal Science of the Chinese Academy of Agriculture 



Science (CAAS) has developed a highly inbred strain 
based on the inbreeding of one male and one female 
WZSP by full-sib mating [8-10]. This inbred strain of 
miniature pig, with a relatively high level of homozygosis 
and genetic stability, provides us with genetically identi- 
cal test animals to achieve good reproducibility in la- 
boratory experiments. However, the detailed genomic 
structure of this strain is still unknown. Since a clear 
genetic background is of crucial importance in develop- 
ing an effective animal model, although the genome se- 
quence of Duroc swine generated by the International 
Swine Genome Sequencing Consortium (SGSC) has 
been publicly available for years [11], it is still short of 
in-depth analysis and understanding of the genome. 

Here we report the sequencing and analysis of an 
inbreeding WZSP genome, which reveals unique genomic 
features, including the over representation of tRNA- 
derived transposable elements consisting of approximately 
2.2 million copies accounting for 12.4% of the genome, as 
well as a relatively high degree of homozygosis of the 
diploid genome and its unusual distribution of heterozy- 
gosis. In addition, we investigated the counterparts of 
human drug target genes and genes associated with dis- 
ease in the pig. Our analysis reveals that the pig resembles 
human closely, but attention should be drawn to the dif- 
ferences between human and pig when using pigs as an 
animal model. The reported genome, together with our 
detailed analysis, sheds light on our understanding of the 
pig genome and its evolution, increasing our understand- 
ing of human health, and enhancing possibilities for the 
creation of useful pig models. 

Data description 

Genomic DNA was isolated from peripheral blood of a 
male WZSP and then used to construct Solexa libraries 
with various insert sizes ranging from 170 bp to 20 Kbp 
(including 170 bp, 350 bp, 500 bp, 800 bp, 2 Kbp, 5 Kbp, 
10 Kbp and 20 Kbp). The length of reads generated from 
short insert size libraries (170-800 bp) and large insert 
size libraries (>2 Kbp) were 100 bp and 50 bp respect- 
ively. A total of 340 Gbp or 126-fold coverage of raw 
paired-end data were generated from these libraries. 
After removing the duplicated and low quality reads, 
210 Gbp or 78-fold coverage of data was retained for as- 
sembly (Table 1, Supplementary Text 1.1, Table SI in 
Additional file 1). Kmer-based analysis estimated the size 
of genome at approximately 2.5 Gbp with a very low sig- 
nature of heterozygosity of the diploid genome. The high 
quality reads were collected and assembled using SOAP- 
denovo [12]. Reads from short-insert size libraries were 
used to build contigs and all libraries were then used for 
scaffolding, from smallest to largest insert-size libraries, in 
a stepwise process. Gaps in the scaffolds were filled by 
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unambiguous local assembly of unmapped reads with one 
end uniquely mapped to the flanking regions of the gaps 
(Supplementary text 13 in Additional file 1). This process 
resulted in an assembly of 2.6 Gbp, and the N50 size of 
contigs and scaffolds were 23.5 Kbp and 5.4 Mbp, respect- 
ively (Table 1). To assess the completeness of the assembly, 
reads from short-insert size libraries were re-mapped onto 
the assembly. Approximately 98% of the reads were map- 
pable, suggesting our assembly is complete and that most 
of the data were represented. Supporting data, genome as- 
semblies and annotation files are available from the Giga- 
Science database [13]. 

Analysis 

Heterozygosis of inbred diploid 

Twenty generations of inbreeding should result in a 
high-level homozygosis of the diploid genome, but a pre- 
vious study of this pedigree identified a certain genomic 
region where a high rate of polymorphism was main- 
tained [14-15]. With the genome sequence in hand, we 
were able to investigate genomic regions with unusually 
high rates of homozygosis or heterozygosis. Polymorph- 
ism in the diploid genome, including single nucleotide 
polymorphism (SNP) and short insertions and deletions 
(short InDels) were identified by investigating the short 
read alignment, using the assembly as a reference. We 
detected 2.8 M SNPs and 392 K short InDels, resulting in 
a heterozygous SNP rate of 0.118% and short InDel rate of 
0.017% (a combined rate of 0.135%), slightly higher than 
in human genome and lower than a highly inbred Iberian 
pig [16]. Although it is well known that the genomic di- 
versity of pigs in China is higher than in other populations 
[17], considering the inbreeding, the diversity within the 
WZSP is unexpectedly high. We further analyzed 17 in- 
bred mice [18] and found correlation coefficients (r) be- 
tween observed heterozygosis and expected heterozygosis 
based on estimated inbreeding coefficients that are close 
to zero (r<0.05). Although a simple examination, it sug- 
gests that estimating heterozygosis based on pedigree- 
derived inbreeding coefficients might not be a very good 
indicator of genomic heterozygosis, since genomic hetero- 
zygosis could be subject to various factors such as muta- 
genesis, recombination, selection and gene flow among 
other demographic factors. 

The distribution of heterozygosis along the assembly 
was studied by a 50 Kbp non-overlapping sliding window. 
For comparison, the heterozygosis of human, naked-mole 
rat (NMR) genomes [19] and the inbred mouse at F280 
[18] were also investigated. Human beings are outbred, 
while NMRs are proposed to be naturally inbred due to 
their eusocial behavior; a colony is produced by a queen 
mating with a few breeding males. The distribution of het- 
erozygosis in WZSP is quite different compared to human 



and NMR genomes, but more like the inbred mouse gen- 
ome (Figure 1). The heterozygosis between two haploids 
of human, NMR, WZSP and inbred mouse were 0.10%, 
0.07% [19], 0.13%, and 0.01% respectively. Although it is 
the highest in WZSP, 60% of the WZSP genome showed 
extremely low heterozygosis (< 0.01%), compared with 
only 11% in NMR and 8% in human, suggesting that long 
term of inbreeding has purged a large proportion of het- 
erozygosity. Most regions of the human and NMR gen- 
omes share a similar rate of heterozygosis, a sharp 
contrast to the WZSP genome where it varied extensively, 
indicating that the effect of purification by inbreeding 
varies between different genomic loci. Highly heterozy- 
gous regions imply resistance to purification during the 
inbreeding process, which may be due to recombination, 
genetic drift and mutation, or other unknown mechan- 
isms. One possibility is that homozygosis in these regions 
may lead to failed fertility, abnormal development or 
death, and thus excluded from our observation. Investiga- 
tion of these genomic regions was performed to shed light 
on which genes remained heterozygous during inbreeding. 
We first focused on the top 5% of divergent regions with a 
heterozygous rate ranging from 0.55 - 1%, and found that 
795 genes were shown to be located within these regions, 
and genes annotated with binding function were signifi- 
cantly enriched (p-value < 0.05). We also calculated the 
heterozygous rate of each coding gene, and a func- 
tional category enrichment analysis was performed 
focusing on the 1,000 most heterozygous genes with 
a heterozygosis of at least 0.48%. Olfactory receptors, 
G-protein coupled receptors, and other genes involved in 
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signaling pathways were significantly over represented in 
this gene set (Table S3 in Additional file 1). The high diver- 
gence of genes in signaling pathways may be needed to 
maintain the elaborate and sophisticated regulating system, 
and thus cannot be purged during inbreeding. Since both 
recombination and selection have influence on heterozygo- 
sis, we studied the correlation between recombination and 
heterozygosis in these high-heterozygosis genes and found 
no observable correlation (correlation coefficient < 0.05). A 
further population study of this family will enable us to dis- 
tinguish whether such heterozygous genes resulted from se- 
lection or recombination precisely. 

Among the 2.8 M total SNPs identified, 20 K, 616 K 
and 2.16 M were located in the exon, intron and inter- 
genic regions, resulting in SNP rates of 0.07%, 0.11% and 
0.12% respectively (Table S4 in Additional file 1). The 
SNP rate in coding regions was much lower compared 
to non-coding regions, consistent with selection con- 
straint. The ratio of transition/transversion was 2.54 and 
the ratio of synonymous to nonsynonymous SNPs was 
0.44, comparable to human and other mammalian gen- 
omes. For the 39 K short InDels, only 348 (or 0.09%) 
were located in coding sequences (CDS) affecting a total 
of 311 genes. It is reasonable that less InDels are present 
in CDS since they may lead to loss-of-function due to 
frame shift in the triplet-based codon. However, we 
found 202 frame-shift mutations caused by InDels, most 
of which were associated with olfactory receptors, sug- 
gesting fast evolution and a high rate of gene birth and 
death within this category. 

Transposable elements in the pig genome 

Transposable elements (TEs) are known to be well 
represented in vertebrate genomes and play an import- 
ant role in shaping genome structure. Their high activity 
in transposition and recombination usually leads to 
structural variations, and is associated with gene birth 
and death. Therefore, it is vital to understand the char- 
acteristics of repeat elements in a newly sequenced gen- 
ome. Based on the mechanism of transposition, TEs can 
be classified into either DNA transposons, which result 
from duplication, or retrotransposons, which are gener- 
ated through retrotranscription via an RNA intermedi- 
ate. Retrotransposons can be further classified into long 
terminal repeats (LTRs), long interspersed elements 
(LINEs) and short interspersed elements (SINEs). 

Repetitive elements in pigs have been extensively docu- 
mented through limited data or via experimental method- 
ology [20,21]. By searching against the RepBase using 
Repeatmasker [22], 38.2% of the assembly was identified 
as TEs, more than horse (~ 32%) but less than cattle 
(~ 42%). Retrotransposons are the most prevalent TEs and 
constitute 36% of the WZSP genome, while only 2.2% of 
the genome was identified as DNA transposons. Among 



the retrotransposons, 18% of the genome consisted of 
LINEs, making it the most abundant category in the pig 
genome, but the proportion is still less than that in cattle 
and horse genomes (Table S5 in Additional file 1). Similar 
to other mammalian genomes, LINE/Ll is the most abun- 
dant class in the pig genome, accounting for 16.8% of the 
total genome size, comparable to 17% in horse and 18% in 
human, but slightly more than the 12% present in cattle. 
Notably, SINEs occupy 13.6% of the pig genome, making 
it the most SINE-rich species comparing to cattle (9.7%), 
horse (2.5%) and human (11.3%), suggesting SINEs were 
more active in the pig. However, not all SINE subfamilies 
were well represented in the pig genome. We found that 
SINE/tRNA was the most abundant class, including over 
2.2 million copies with a total size of 325 Mbp, accounting 
for 12.4% of the genome (comparable with a previous re- 
port [23]), which is even more than the well known Alu 
elements derived from 7SL RNA in humans, estimated at 

I million copies and occupying 11.4% of the human gen- 
ome. The length of SINE/tRNA elements ranged from 

II bp to 2,028 bp, but 77.8% of them ranged from 100- 
300 bp in length. The distribution of length revealed three 
peaks, the main peak located around 263 bp with smaller 
peaks at 106 bp and 55 bp. This is different from Alu in 
the human genome, which has a typical size of 300 bp 
(Figure S2 in Additional file 1). 

Based on sequence similarity, the 2.2 million copies of 
pig SINE/tRNA can be divided into 47 groups; the con- 
sensus of each group was inferred based on multiple se- 
quence alignments. The number of members in each 
group ranged from 1 to 887,807 and the top five biggest 
groups comprised more than 70% of the total SINE/ 
tRNA sequences. In addition, 1.98 million or 88.5% TEs 
belong to tRNAglu-derived PREl class, which can be 
further divided into 14 subclasses. PRElb, PREla, PRElj, 
PREld2, PRE Ik are the most abundant with more than 
100,000 copies in each subclass. Only a few copies of 
PREl can be found in cattle, and none of them can be 
detected in human and rodent genomes (Table S7 in 
Additional file 1). This result indicated a PREl expan- 
sion after speciation from the last common ancestor of 
pig and cattle. PREl was proposed to originate from the 
CHRS-S family [24], and it is widely distributed in Pha- 
cochoerus aethiopicus (warthog) and Tayassu tajacu 
(peccary), suggesting an expansion prior to speciation of 
the ancestral pig. The origin of PREl is estimated to be 
at least 43.2 Million years ago (Mya) [25]. Considering 
the divergence of cattle and pigs is estimated to have oc- 
curred around 54.1 Mya, the expansion of PREl can be 
narrowed down to between 43.2 and 54.1 Mya. 

Pair-wise comparison of transposable elements and 
their inferred consensus sequences were also investi- 
gated. The bimodal distribution of divergence with two 
peaks at ~ 20% and ~ 10% (Figure 2A) indicates that pig 
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Figure 2 Divergence distribution of SINE/tRNA in the pig genome. (A) The divergence distribution of all SINE/tRNA, and (B) the divergence 
distribution of classified SINE/tRNA subfamilies. 



has undergone two rounds of SINE/tRNA expansion. 
The divergence of each of PREl subclass was also stud- 
ied (Figure 2B), illustrating a similar pattern among all 
SINE/tRNA examined. It is clear that the peak at 10% 
was almost exclusively due to recent expansions of 
PREla and PRElb. 

Gene prediction and evolution 

To understand the protein coding genes in the pig gen- 
ome, homologous searching and transcription evidence- 
based prediction were performed. Protein sequences 
from closely related mammals and expressed sequence 
tags (ESTs) were used as query sequences to search for 
their counterparts in the assembly. We identified 20,326 
genes. The average length of gene, exon and intron are 
approximately 29.4 Kb, 1.5 Kb and 3.5 Kb respectively, 
which is similar to cattle (33.6 Kb, 1.5 Kb and 3.9 Kb) 
and horse (32.3 Kb, 1.5 Kb and 3.7 Kb). Based on the re- 
ciprocal best BLAST alignment, 16,564, 17,475 and 16,923 
orthologous groups were identified in pig/human, pig/cat- 
tle and pig/horse, respectively, and the distribution of se- 
quence similarities of orthologs revealed that the pig is 
most closely related to cattle with an average amino acid 
identity of 85.9%, while sharing an average identity of 
84.1% with human (Figure S3). 

A TreeFam-based gene family analysis was conducted 
to study gene family evolution and estimate the diver- 
gence time of pig from other sequenced mammals. Gene 
sequences of human, horse, dog, cat, cattle, rat and 
mouse were used in this analysis (see Supplementary 
Text 3.2 in Additional file 1). We found that 18,814 or 
92.6% of pig genes can be assigned into 9,360 gene fam- 
ilies, at least with one other species. A total of 3,211 sin- 
gle copy gene orthologous groups were obtained, and 
sequences for each species were concatenated into a 



super gene to infer the phylogenetic tree (Figure 3). As 
expected, pig and cattle clustered together to represent 
Artiodactyla - both sharing a last common ancestor ap- 
proximately 51.4 Mya. 

To obtain better insight into the dynamic changes of 
evolution, gene family expansion and contraction were 
investigated by considering gene birth and death in an 
evolutionary framework [26]. By comparing the pig gen- 
ome to cattle genome, which is its closest relative, we 
identified 27 expanded gene families (810 genes) while 
17 gene families (169 genes) showed contraction. A typ- 
ical example of a gene family expansion points to the 
functional categories including drug metabolism and 
synthesis of lipids, in which six copies of Cyp2J2 were 
detected and originated from tandem duplication. A 
gene family related with somatotropin hormone function 
was contracted in the pig, which might underlie the 
smaller adult size of this inbred mini pig. Additionally, 
240 orphan genes were identified with no counterpart in 
any other organisms under current criteria. They were 
clustered into 39 groups, representing pig-specific genes 
which may be due to gene gain or loss, or fast evolution, 
thus suggesting functional changes. Olfactory receptors 
and signaling-related genes were enriched in pig specific 
families (Table Sll in Additional file 1), suggesting fast 
evolution and/or unique regulation in the pig lineage. 
Genes of viral origin were also over represented in the 
pig-specific gene set due to the presence of porcine en- 
dogenous retroviruses (PERVs), which are known to be 
integrated into the pig genome. 

A comparison of gene content between human and 
pig allows us to identify genes gained and lost during 
evolution (see Supplementary text 3.2 in Additional file 2). 
In total, 245 genes were identified as gained in pig 
(Table S12 in Additional file 2), and may be associated 



Fang ef al. Giga Science 2012, 1:16 
http://www.gigasciencejournal.conn/content/1/1/16 



Page 6 of 1 1 



|— 76.0(68.9-S8.0) 



S03(72.7-93.0) 



■9(564.59.4) 



P.3(94 6-108.3) 



151 4(48 1-54.1) 



>5. 1(86.6-100.9) 



'25.0(17 8-40 4) 



Horse 
Dog 
Cat 
Cattle 

Pig 

Human 

Rat 

Mouse 



60 



40 



20 



_j Million yean ago 



Figure 3 Estimation of divergence time. The time of divergence (witli error range sliown in parentlieses) of pig and seven otiier mammals 
based on ortliology relationsliips. Distances are sliown in millions of years. 



with the emergence of specific functions and physiology. 
For instance, the gene WZSPO 10943 specifically existed 
in the pig genome compared to human, which encodes 
the interferon-induced transmembrane protein, and is 
linked with the proliferation, migration and invasion of 
glioma cells. It has also been proposed as a potential 
therapeutic target for gliomas [27]. In addition, our ana- 
lysis identified 270 lost genes compared to human 
(Table S13 in Additional file 2); however, this may result 
from human gain or pig loss, as well as high sequence 
diversity, thus it is filtered under current criteria. For 
example, gene LDHAL6A was lost in the pig genome, 
but exclusively expressed in human testis, indicating 
human gain or pig loss events occurred after the diver- 
gence of these two species. Particular care should be 
taken if using pig as a model for studying human biol- 
ogy associated with these genes. We found that PROZ, 
the gene encoding protein Z (a member of blood pro- 
teins that leads to the formation of blood clots and 
thrombosis [28]), is lost in pig. The CETP gene is also 
lost and encodes the cholesteryl ester transfer protein - 
a plasma protein that facilitates the transport of choles- 
teryl esters and triglycerides between the lipoproteins. 
Heightening interest in CETP has been stimulated due 
to the discovery that CETP inhibitors are intended to re- 
duce the risk of atherosclerosis by improving blood lipid 
levels, thereby benefiting patients with cardiovascular 
disease [29-31]. Pyridoxal phosphatase (encoded by the 
gene PDXP) is also lost (Table S13 in Additional file 2), 
and participates in vitamin B6 metabolism, acting as a 
catalyst in the hydrolysis of pyridoxal phosphate [32] . 

Pseudogenes are DNA sequences that resemble func- 
tional genes but are generally thought to have lost func- 
tion, implicating a biological and evolutionary story 
behind the sequences. By mapping human protein 
sequences against the pig assembly, we detected 105 



pseudogenes in pig (Table S14 in Additional file 2). 
Among these genes, some play an important role in or- 
ganism development and physiological processes, includ- 
ing UCPl, AGR3, CLDM6, NMBR, KCNK18, GANC and 
CES2, For example, Pseudogene UCPl, which was dis- 
rupted about 20 million years ago, provides an explan- 
ation for the lack of brown adipose tissue and poor 
thermoregulation in piglets [33]. Pseudogene KCNK18 
inactivates the potassium channel subfamily K member 
18, which may help trigger pain centers in the brain and 
cause severe headaches [34,35]. Another pseudogene 
GANC inactivates the neutral alpha-glucosidase, a key 
enzyme involved in glycogen metabolism and associated 
with susceptibility to diabetes [36] . Interestingly, pseudo- 
gene NMBR was only found in the WZSP inbred-line, 
not in Bama miniature pig or large white pig, as con- 
firmed by PGR validation. NMBR belongs to the GPCR 
family, and is a receptor for Neuromedin B {NMB), 
which is a mammalian homolog of amphibian bombesin. 
The NMB/NMBR pathway is involved in the regulation 
of a wide variety of physiological processes and beha- 
viors, such as thermoregulation, cell growth, food intake 
and anxiety- related behavior [37]. Disruption of the neu- 
romedin B receptor gene results in dysregulation of the 
pituitary-thyroid axis and partial resistance to diet- 
induced obesity, as well as a decrease in burying behav- 
ior through increasing levels of serotonin in the brain 
[38,39]. Lack of functional neuromedin-B receptor in 
WZSP may have functional implications for the unique 
physiology and behavior of WSZ inbred pigs. Here we 
provide the list of pseudogenes for further study by the 
research community. 

Genes that have undergone positive selection can pro- 
vide very useful pointers to the adaptation process dur- 
ing recent evolution. In this study, we identified 19 
genes with evidence of positive selection (Table S15 in 
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Additional file 2), including five DNA-repair and cell 
cycle control-related genes {LSMIO, APLF, TP53I13, 
NEILS, CDKN3), and five genes involved in organism de- 
velopment (COGi, CHGB, GLIS2, FECH, STK16), One 
example of these positively selected genes is CHGB, 
which encodes chromogranin B, a neuroendocrine 
secretory granule protein that has an effect in reducing 
the availability of glucose and lowers the risk of cancer 
during aging [40]. Up-regulation of chromogranin B has 
been observed in dwarf mice and fasting control mice 
[41]. Therefore, the fast evolution of this gene might 
yield insights into dwarfism. Another gene under posi- 
tive selection is GLIS2, which encodes the GLIS family 
zinc finger 2, which plays an essential role in kidney de- 
velopment and neurogenesis [41]. 

Although it is not clear how to make the connection 
between the unusual evolutionary history of a gene and 
the unique traits of pigs without additional functional 
experiments, the data set showed can be a valuable re- 
source for further study. Most importantly, attention 
should be given to understanding the situation of target 
genes when using pig as an animal model for biomedical 
studies. 

Porcine endogenous retrovirus in the pig genome 

Many patients with end-stage organ failure are dying 
from worldwide shortages of human organ donors. 
Xenotransplantation was proposed to be an efficient al- 
ternative, and as humans' closest relatives, nonhuman 
primates were originally considered to be the best organ 
donor. However, ethical restrictions, shortages and 
endangered status of some species, together with the risk 
of pathogen transmission from nonhuman primates to 
human, have impeded their application. 

The similarities between humans and pigs at anatom- 
ical, physiological and genomic levels provide unique 
advantages for the use of pigs as a potential donor spe- 
cies for xenotransplantation [42]. Economic advantages 
and ethical considerations also promote their biomedical 
use. Furthermore, inbred WZSPs are particularly suitable 
for pig- to-human xenotransplantation [43], due to their 
physical size, as well as their inter- individual similarity. 
However, in addition to the immunological barriers, por- 
cine endogenous retroviruses (PERVs) created an obs- 
tacle for safe transplantation. PERVs are ancient viral 
sequences integrated into the pig genome, and transmit 
vertically to the offspring, making them very difficult to 
eliminate. Since in vitro studies indicated that PERV 
released from porcine cells may infect human cells, the 
potential risks have sparked great concern over the use of 
pigs in xenotransplantation [44-49]. Therefore, it is of 
great importance to investigate PERVs in the pig genome. 

A complete PERV genome includes gag, pol and env 
genes, as well as 5' and 3' LTRs. The gag gene encodes a 



core protein which is a group-specific antigen; pol 
encodes an intergrase, reverse transcriptase and prote- 
ase, which play critical roles in the virus life cycle; and 
env encodes envelope proteins that determine host trop- 
ism of the virus [50]. We screened the pig genome with 
the PERV proteins (gag, env, pol) as well as genome 
sequences, including PERV-A, PERV-B, PERV-C and 
PERV-C/A downloaded from NCBI (Supplementary text 
4 in Additional file 1). In total, we detected 182 copies 
of potential virus-derived genes in the assembly, and the 
product of coverage and identity between virus protein 
and pig encoding counterpart showed clear bimodal 
distribution (Figure 4A). Most of the virus-derived 
sequences were distantly related to queries, suggesting 
that many mutations were accumulated after the ancient 
integration event, whereas the remaining mutations were 
almost identical to virus sequences, indicating that those 
PERVs might still be active and replicating in the pig 
genome. Indeed, the polymorphic nature of PERV inte- 
gration sites [51] within the pig genome suggests the in- 
dependence of integration. When it comes to pig-to- 
human infection, more attention should be paid to active 
PERVs with full-length sequences. 

When Filtered with identity and coverage > 80%, we 
identified eight virus-derived protein coding genes, includ- 
ing 5 env, 1 gag and 2 pol (Table S16 in Additional file 1). 
Among them, 3 env, 1 gag and 2 pol genes were from 
PERV-A, and 2 env genes from PERV-B. Only 1 env 
gene from PERV-A was intact, and no PERV-C derived 
genes could be found. In addition, we detected two cop- 
ies of complete sequences of PERV integrated into the 
genome (Table S17 in Additional file 1). Figure 4B 
shows the evidence of the integration of a copy of PERV 
in the middle of the assembled scaffold, and the inte- 
grating points were well supported by paired-end reads. 

A previous study suggested that PERVs might be 
removed by breeding or gene knockout techniques [50], 
and great effort has been made with the aim of eliminat- 
ing PERV from the donor pig [62]. Since type C retro- 
viruses like MuLV, BaEV, GALV and HERV were able to 
cause disorders in the host, such as leukemia, carcinoma 
or immunodeficiency, the highly maintained conserva- 
tion of PERV-C [63] raises concerns of pig-to-human in- 
fection. The natural lack of PERV-C in the WZSP breed, 
together with the small amount of replication-competent 
PERV in its genome, might facilitate its application in pig- 
to-human xenotransplantation. 

Human druggable domain and disease associated 
orthologs in pig 

The similarities between humans and pigs make pigs at- 
tractive as an animal model for biomedical studies. Pre- 
existing pig models of human diseases include a broad 
array of biomedical topics, such as heart physiology. 
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Figure 4 PERVs in the WZSP genome. (A) Distribution of similarity between virus protein and its counterpart in WZSP genome by considering 
factors of sequence identity and coverage. (B) Structure of wliole PERV genome inserted into tine pig genome. Tine upper panel shows the 
sequencing depth of the assembled scaffold. The middle panel visualizes the paired-end alignment of the scaffold providing evidence of correct 
assembly. The bottom panel shows the genome structure of PERV and its integration site in the scaffold. 
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reproductive function, skin physiology, brain function, 
transplantation, gut physiology and nutrition, tissue engin- 
eering, respiratory function and infectious disease models, 
to name a few (for a review see reference [2]). It is worth 
noting that pigs are the most promising animal model for 
atherosclerosis and cardiovascular disease. More specific- 
ally, the cardiovascular anatomy and hemodynamics of 
pigs are comparable to humans; they develop human-like 
spontaneous and diet-induced atherosclerotic lesions, and 
can experience sudden death when under stress [54-57]. 
Unlike small animal models, the progression of pig ath- 
erosclerosis can be generally quite slow, and both athero- 
sclerotic plaque distribution and composition follows a 
pattern comparable to that of humans [58-60]. As mini- 
ature pigs, WZSPs are particularly suitable for biomedical 
research, due to their reasonable size and early maturity. 

Understanding the pig counterparts of human drug 
targets is vital for pre-clinical drug screening, using pigs 
as a model organism. We downloaded the drug target 
gene information from DrugBank [61], and identified 
the counterparts in mouse, macaque and pig genomes. 
The DrugBank information showed that 1,624 druggable 
human genes have an ortholog in at least one of these 
three species: 1,616 are shared between human, mouse, 
macaque and pig, and 1,618 genes were identified in pig. 
The distribution of protein sequence identity between 
human and pig genes shows a peak at 95% (Figure S6 in 
Additional file 1). Compared to mouse and macaque, pig 
has specifically lost three drug target genes (REGIA, PROZ 
and HSD3B2), but retains S100A12 and GNLY, which have 
been lost in the mouse lineage (Figure S7, in Additional 
file 1 Table S18 and Table S19 in Additional file 2). Com- 
paring the differences between drug target genes in 
human and their counterparts in pig will provide vital 



information for biomedical studies. For example, CYP3A4 
is a hepatic microsomal P450, responsible for the oxida- 
tive metabolism of over 50% of clinically relevant drugs. 
Two post-translational modification sites [62] (Thr264 
and Ser478) are responsible for phosphorylation and 
subsequent ubiquitin-dependent proteasomal degrad- 
ation in human. Interestingly, amino acid substitutions 
were discovered in these two modification sites in pig 
(T264Q and S478T), such substitutions may make this 
protein more stable for oxidative biotransformation of 
various endo- and xenobiotics, and further biomedical 
studies focusing on the CYP3A4 gene (using the pig as a 
model) should be aware of such modifications (Figure S8 
in Additional file 1). 

Amongst 247 coronary artery disease (CAD) related 
genes, only ARMS2 and CETP were lost in our assembly 
(Table S20 in Additional file 2). The loss of CETP 
explains why its activity was undetectable in a previous 
study [63] while ARMS2 is a primate-specific gene. We 
detected six copies of Cyp2J2 that may have resulted 
from tandem duplication, and phylogenetic analysis 
reveals its expansion in the pig and mouse lineages 
(Figure S9 in Additional file 1). Cyp2J2 is involved in 
various types of drug metabolism and synthesis of chol- 
esterol, steroids and other lipids, and can help increase 
functional recovery of cardiomyocytes [64]. This tandem 
duplicated cluster may have similar or divergent func- 
tionality and/or regulation, and may underlie the differ- 
ences seen in drug metabolism in the pig. 

Though pig and human share most CAD genes, we 
also found a few differences. For instance. Fibrinogen 
alpha chain {EGA) and fibrinogen beta chain (EGB) both 
function in fibrin formation and platelet aggregation. 
The proper functions require proper cleavage of its 
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protein sequences. Using the UniProtKB/Swiss-Prot 
database we found that the cleavage sites between R123 
& D124 in FGA [65] and K152 & D153, K163 & D164 in 
FGB [66] are recognized by plasmin and break down fi- 
brin clots. In the pig genome, we found amino acid sub- 
stitutions in these cleavage sites (R123D in FGA and 
D153R, K163R in FGB), Such modifications may result 
in a different pattern of cleavage of these two proteins, 
thus suggesting potential physiological differences in fi- 
brin clot degradation (Figure SIO in Additional file 1). 

Another example is plasminogen activator inhibitor 2 
(PAI'2), one drug target for tenecteplase (DrugBank ID: 
DB00031) and urokinase (DrugBank ID: DB00013). 
Three mutations (C5Y,C79G and C405N) were observed 
in the pig genome, and such substitutions may result in 
the loss of disulfide bonds (C5-C405 and C79-C161), 
which may prevent the polymerogenic conformation of 
PAI-2 [67] (Figure Sll in Additional file 1). 

Clear genetic information will enhance the possibilities 
of creating useful WZSP models, and may lead to a bet- 
ter understanding of the molecular mechanisms under- 
lying cardiovascular disease. 

Discussion 

Our investigation on heterozygosis of the WZSP inbred 
diploids revealed an unexpectedly high rate of poly- 
morphism maintained in certain genomic regions. Gen- 
omic regions with extremely low rates of heterozygosis 
(< 0.001%) account for 60% of the genome, indicating 
that of inbreeding has purged a large proportion of het- 
erozygosity from the genome. However, the mechanism 
to explain why high heterozygosis was maintained dur- 
ing inbreeding remains unclear. Recessive lethality may 
be one reason, but it cannot be the explanation for all 
heterozygous regions, which account for more than 30% 
of the genome. Recombination and artificial selection 
during inbreeding might be other factors to consider. Se- 
quencing of non-inbred WZSP as an out-group and the 
inbreeding pedigree will be interesting and should help 
us to understand how heterozygosis decreased or was 
maintained, and how chromosome structures reorga- 
nized during the inbreeding process from generation to 
generation. 

Prediction and annotation of transposable elements in 
the pig genome showed over representation of tRNA 
derived TE, with approximately 2.2 million copies 
accounting for 12.4% of the genome. According to previ- 
ous studies, SINE RNAs have been proven to impact 
gene expression and regulation [68-70]. The over repre- 
sentation of tRNA-derived SINEs in the pig genome may 
also play certain roles, not only in shaping the genome 
structure, but also in increasing the complexity of gene 
regulatory networks and population structure, which 
may have resulted from unequal SINE-SINE crossover 



by associating genes with new c/5-elements. The evolu- 
tion and influence of Alu in the human genome has 
been extensively studied, whereas the function of SINE/ 
tRNA in the pig genome still needs to be explored in the 
future. Most of the TEs are proposed to have escaped 
from selection, thus the Suidae- specific expansion of the 
TREl class is a valuable resource to understand the evo- 
lutionary history of Suidae, as well as the process of do- 
mestication and selective breeding. Further, population 
scale and functional genomics will deepen our under- 
standing of its role in shaping the pig genome. 

Based on our analysis of porcine endogenous retro- 
virus, we detected a limited number of active PERVs and 
a natural lack of PERV-C in the WZSP genome, which 
lowers the risk of pig-to-human infection during xeno- 
transplantation - highlighting the use of these pigs as po- 
tential organ donors. A detailed investigation on gene 
evolution in the pig, as well as research on the pig coun- 
terparts of human druggable domain and disease related 
genes, revealed that pigs strongly resemble human 
beings, but also demonstrates that attention should be 
paid to the differences between human and pig when 
taking pig as an animal model. 

Methods 

The genome was sequenced on the Illumina HiSeq™ 
2000 platform (Illumina, San Diego, CA, USA). The 
sequenced individual male WZSP was from the Institute 
of Animal Science of CAAS, Beijing, China. The genome 
was assembled using SOAPdenovo. See Supplementary 
Information for data analysis and additional details. 

Availability of supporting data 

The WZSP whole genome shotgun projects have been 
deposited at DDBJ/EMBL/GenBank under the accession 
number of AJKKOOOOOOOO. The version described in this 
paper is the first version, AJKKOIOOOOOO. All short read 
data have been deposited into the Short Read Archive 
under accession number of SR AOS 1254. Genome assem- 
blies and annotation files are also available from the 
GigaScience database [13]. 

Additional files 



Additional file 1: The sequence and analysis of an inbred pig 
genome method. 

Additional file 2: Supplementary tables. 
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